Chinese Abbreviation Identification Using Abbreviation-Template

نویسنده

  • Houfeng Wang
چکیده

Chinese abbreviations are frequently used without being defined, which has brought much difficulty into NLP. In this study, the definition-independent abbreviation identification problem is proposed and resolved as a classification task in which abbreviation candidates are classified as either ‘abbreviation’ or ‘non-abbreviation’ according to the posterior probability. To meet our aim of identifying new abbreviations from existing ones, our solution is to add generalization capability to the abbreviation lexicon by replacing words with word classes and therefore create abbreviation-templates. By utilizing abbreviation-template features as well as context information, a SVM approach is employed as the classifier. The evaluation on a raw Chinese corpus obtains an encouraging performance. Our experiments further demonstrate the improvement after integrating with extended word clustering (We design it to enable a joint learning of word classes), morphological analysis, substring analysis and person name identification. To our knowledge, this is the first definition-independent machine learning approach for Chinese abbreviation identification.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Chinese Abbreviation Identification Using Abbreviation-Template Features and Context Information

Chinese abbreviations are frequently used without being defined, which has brought much difficulty into NLP. In this study, the definitionindependent abbreviation identification problem is proposed and resolved as a classification task in which abbreviation candidates are classified as either ‘abbreviation’ or ‘non-abbreviation’ according to the posterior probability. To meet our aim of identif...

متن کامل

A Chinese Dataset with Negative Full Forms for General Abbreviation Prediction

Abbreviation is a common phenomenon across languages, especially in Chinese. In most cases, if an expression can be abbreviated, its abbreviation is used more often than its fully expanded forms, since people tend to convey information in a most concise way. For various language processing tasks, abbreviation is an obstacle to improving the performance, as the textual form of an abbreviation do...

متن کامل

Automatic Chinese Abbreviation Generation Using Conditional Random Field

Boulder, Colorado, June 2009. c ©2009 Association for Computational Linguistics Automatic Chinese Abbreviation Generation Using Conditional Random Field Dong Yang, Yi-cheng Pan, and Sadaoki Furui Department of Computer Science Tokyo Institute of Technology Tokyo 152-8552 Japan {raymond,thomas,furui}@furui.cs.titech.ac.jp Abstract This paper presents a new method for automatically generating abb...

متن کامل

Cluster based Chinese abbreviation modeling

Abbreviations in Chinese are widely observed in Chinese spoken language. Automatic generation of Chinese abbreviations helps to improve Chinese natural language understanding systems and Chinese search engine. The abbreviation generation is treated as a character-based tagging problem. Due to limited training data, Chinese abbreviation generation suffers from data sparseness. Two types of strat...

متن کامل

Vocabulary expansion through automatic abbreviation generation for Chinese voice search

Long named entities are often abbreviated in oral Chinese language, and this usually leads to out-of-vocabulary(OOV) problems in speech recognition applications. The generation of Chinese abbreviations is much more complex than English abbreviations, most of which are acronyms and truncations. In this paper, we propose a new method for automatically generating abbreviations for Chinese named en...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2006